Data Collection

Introduction

Statistics is fundamentally concerned with understanding the world through data. The very first and most crucial step in any statistical investigation is the collection of data. The quality of our analysis, the validity of our conclusions, and the effectiveness of our decisions depend entirely on the quality of the data we collect. Just as a chef cannot prepare a good meal with rotten ingredients, a statistician or researcher cannot arrive at meaningful insights with poor or flawed data.

This chapter explores the foundational concepts of data collection. We will delve into the various sources from which data can be obtained, the different methods used to collect it, and the instruments, like questionnaires, that are prepared for this purpose. We will also examine the critical choice between conducting a comprehensive census versus a more manageable sample survey, and the potential errors that can arise in the process. Understanding these aspects is essential for anyone looking to conduct research, formulate policy, or simply be a critical consumer of information in a data-driven world.

What Are The Sources Of Data?

The source of data is a primary consideration in any statistical study. Depending on whether the data is collected firsthand by the investigator or obtained from an existing source, we classify data into two categories: primary and secondary.

Primary Data

Primary data is data that is collected for the first time by the investigator or researcher with a specific purpose or objective in mind. This data is original in character, gathered directly from the source. The process of collecting primary data is often time-consuming and expensive, but it provides information that is tailored specifically to the research question.

Example: If the Ministry of Education wants to understand the impact of the Mid-Day Meal scheme on student attendance in rural government schools, and it sends enumerators to schools to collect this information directly, the data gathered would be primary data.

Secondary Data

Secondary data refers to data that has already been collected by someone else for some other purpose and is available for use. The investigator simply uses this pre-existing data for their own study. This data is not original to the current investigator. Using secondary data is generally cheaper and quicker than collecting primary data.

Example: If a university researcher uses the data from the Census of India or reports published by the National Statistical Office (NSO) to analyse literacy rates across different states, they are using secondary data.

Sources of secondary data include:

Published Sources: Government publications (e.g., Census of India, NSO reports), reports of international bodies (e.g., World Bank, IMF), journals, newspapers, and research publications.
Unpublished Sources: Records maintained by private firms, research institutions, and government departments that are not published but may be accessible upon request.

Basis of Comparison	Primary Data	Secondary Data
Originality	Original, as it is collected firsthand by the investigator.	Not original, as it has been collected by someone else.
Cost	More expensive, as it involves the entire process of collection.	Relatively inexpensive, as it only involves accessing existing data.
Time	Very time-consuming.	Less time-consuming.
Suitability	More suitable and relevant as it is collected for the specific purpose of the study.	May not be perfectly suitable; definitions and scope might differ.
Precaution	Requires careful planning of the survey.	Requires careful scrutiny of the source, reliability, and suitability of the data.

How Do We Collect The Data?

The collection of primary data is a systematic process that involves several steps, from preparing the survey instrument to choosing the mode of data collection. The survey's objective is to gather information by asking a set of questions to a specific group of people.

Preparation Of Instrument

The most common type of instrument used in surveys is the questionnaire. A questionnaire is a list of questions designed to elicit information from respondents. The quality of a questionnaire is critical to the quality of the data collected. A well-designed questionnaire can encourage accurate and complete responses, while a poorly designed one can lead to confusion, non-response, and inaccurate data.

Qualities of a Good Questionnaire

Limited Questions: The number of questions should be as few as possible.
Simplicity: The language should be simple, clear, and unambiguous.
Logical Sequence: Questions should be arranged in a logical order, moving from general to specific.
No Double-Negatives: Avoid questions with double negatives as they can be confusing (e.g., "Don't you think smoking should not be banned?").
No Leading Questions: Questions should not suggest a particular answer (e.g., "Don't you agree that the new policy is good for the country?").
Instructions: Clear instructions should be provided for filling out the questionnaire.
Pre-testing (Pilot Survey): The questionnaire should be tested on a small group before the main survey to identify any problems.

Questions can be closed-ended (e.g., multiple choice, yes/no) or open-ended (allowing respondents to answer in their own words).

Mode Of Data Collection

Once the questionnaire is ready, the investigator must decide on the mode of administering it. The main modes are:

1. Personal Interviews

In this method, the investigator (or a trained enumerator) meets the respondents face-to-face and asks the questions from the questionnaire. The enumerator fills in the answers.

Merits: High response rate, allows for clarification of questions, can be used even if the respondent is illiterate.
Demerits: Very expensive, time-consuming, and can be influenced by the interviewer's bias.

2. Mailing Questionnaire

Here, the questionnaires are sent to the respondents by mail, with a request to fill them out and return them by a specific date.

Merits: Least expensive, can cover a wide geographical area, and is free from interviewer bias.
Demerits: Very low response rate, cannot be used for illiterate respondents, and doubts cannot be clarified.

3. Telephone Interviews

In this method, the investigator contacts the respondent over the telephone and asks the questions.

Merits: Relatively low cost, less time-consuming than personal interviews, and can cover a wide area.
Demerits: Limited to respondents who have telephones, reactions cannot be observed, and not suitable for long or complex questions.

Pilot Survey

Before launching a large-scale survey, it is highly advisable to conduct a pilot survey or pre-test. This involves trying out the questionnaire on a small group of individuals. The purpose of a pilot survey is to:

Test the clarity and wording of the questions.
Assess the performance of the enumerators.
Estimate the time and cost required for the main survey.
Identify any practical problems in administering the survey.

The feedback from the pilot survey is used to refine the questionnaire and the overall survey plan, ensuring a smoother and more effective main survey.

Census And Sample Surveys

When collecting data, a researcher must decide whether to collect information from every single unit of the population or from a representative subset.

Census Or Complete Enumeration

A census is a survey that includes every single element of the population (or universe). A population is the complete set of all items or individuals under study. When we collect information from every household in India (as in the Census of India) or every student in a school, we are conducting a census.

Merits: Provides intensive and highly accurate information. Data is available for every unit.

Demerits: Extremely expensive and time-consuming. It requires a vast administrative setup and is not feasible for many types of research.

Population And Sample

In most statistical investigations, conducting a census is impractical. Instead, we use a sample survey. A sample is a smaller, representative group selected from the population. The process of selecting a sample is called sampling. By studying the sample, we can draw conclusions or make inferences about the entire population.

Example: To find out the average income of the 20,000 households in a town, instead of visiting all 20,000, we could select a representative sample of 200 households and study their incomes. The result from the sample can then be used to estimate the average income for the entire town.

A diagram showing a large population and a smaller sample being selected from it.

Random Sampling

For a sample to be representative, it must be selected impartially. Random sampling is a method where each and every unit of the population has an equal and independent chance of being selected in the sample. This is also known as probability sampling. The two most common methods are:

Lottery Method: All items in the population are assigned a number on identical slips of paper. The slips are mixed thoroughly, and the required number of slips are drawn out blindly.
Random Number Tables: These are tables that have been generated to contain completely random digits. A researcher can use these tables to select the units for the sample without any bias.

Random sampling ensures that the sample is free from the personal bias of the investigator.

Exit Polls

An exit poll is a classic example of a sample survey. During elections, it is impossible to ask every voter who they voted for. Instead, news agencies and survey firms conduct exit polls by asking a randomly selected sample of voters as they leave the polling station. Based on the responses from this sample, they predict the election outcome for the entire population of voters.

Non-random Sampling

In non-random sampling, the selection of units is not based on chance but on factors like the judgment, convenience, or discretion of the investigator. Examples include convenience sampling (selecting easily accessible units) and quota sampling. While easier to conduct, these methods are prone to bias and the results cannot be reliably generalised to the whole population.

Sampling And Non-sampling Errors

In data collection, errors are almost unavoidable. These errors can be broadly classified into sampling and non-sampling errors. Understanding these errors is crucial for assessing the reliability of survey results.

Sampling Errors

A sampling error is the difference between the result obtained from studying a sample (the sample estimate) and the true result that would have been obtained from a census of the entire population (the population parameter). This error arises because a sample is only a part of the population and is unlikely to be a perfect reflection of it.

Example: If the true average height of all students in a college is 165 cm, but a sample of 50 students yields an average height of 167 cm, the 2 cm difference is a sampling error.

Key Point: Sampling error can be reduced by increasing the size of the sample. The larger the sample, the closer it is likely to be to the population, and hence the smaller the sampling error.

Non-sampling Errors

Non-sampling errors are more serious as they can occur even in a census. They are errors that arise during the process of data acquisition and are not related to the act of sampling. Increasing the sample size will not reduce these errors.

Major types of non-sampling errors include:

1. Sampling Bias

This is a type of non-sampling error (despite its name) that occurs when the sampling frame (the list from which the sample is drawn) is faulty or when the selection process is not truly random. This leads to a situation where some members of the population have no chance of being selected, or their chance is lower than others.

2. Non-response Errors

This error occurs when some of the individuals selected for the sample do not respond to the survey. If the non-respondents are different from the respondents in some significant way, the final sample will not be representative of the population. For instance, in an income survey, high-income individuals may be less likely to respond, leading to an underestimation of the true average income.

3. Errors in Data Acquisition

These are errors that occur during the collection of information. They can be due to:

Recording Errors: The enumerator may record the wrong answer.
Respondent Errors: The respondent may provide incorrect information, either deliberately or due to a misunderstanding of the question.
Instrument Flaws: A poorly designed questionnaire can lead to inaccurate answers.

Census Of India And Nsso

In India, two of the most important agencies responsible for collecting, processing, and disseminating large-scale statistical data are the Census of India and the National Statistical Office (NSO).

Census of India

The Census of India is a decennial (conducted every 10 years) complete enumeration of the Indian population. It is one of the largest administrative exercises in the world. It is conducted by the Office of the Registrar General and Census Commissioner, under the Ministry of Home Affairs.

The Census provides a wealth of information on various demographic and socio-economic characteristics of the population, including:

Population size, distribution, and density
Literacy rates
Sex ratio
Housing conditions and household amenities
Economic activity and occupation

This data is a vital source of secondary data for administrators, planners, and researchers and is used for policymaking, demarcation of constituencies, and allocation of funds.

National Statistical Office (NSO)

The National Statistical Office (NSO), under the Ministry of Statistics and Programme Implementation, is the nodal agency for all statistical activities in the country. The NSO was formed by the merger of the National Sample Survey Office (NSSO) and the Central Statistics Office (CSO).

The NSO conducts regular large-scale sample surveys on various socio-economic subjects. While the Census provides a snapshot once every ten years, the NSO surveys provide more frequent and detailed data on specific topics. Some of its key surveys include:

Surveys on household consumer expenditure (used for poverty estimation)
Surveys on employment and unemployment (Periodic Labour Force Survey - PLFS)

The data collected by the NSO is crucial for planning and policy formulation by the Government of India.

Conclusion

Data collection is the bedrock of statistical analysis and evidence-based decision-making. The choice between primary and secondary data, the method of collection—be it a comprehensive census or a targeted sample survey—and the design of the survey instrument are all critical decisions that shape the outcome of any study.

A well-planned data collection process, which uses random sampling to ensure representativeness and takes steps to minimise both sampling and non-sampling errors, is essential for generating reliable and valid data. In India, institutions like the Census of India and the NSO provide a treasure trove of data that helps us understand our society and economy. Ultimately, the ability to collect, interpret, and critically evaluate data is a vital skill for navigating the complexities of the modern world.